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Abstract — This  paper  studies  methods  of  quantitatively  mea¬ 
suring  semantic  information  in  communication.  We  review  exist¬ 
ing  work  on  quantifying  semantic  information,  then  investigate 
a  model-theoretical  approach  for  semantic  data  compression  and 
reliable  semantic  communication.  We  relate  our  approach  to 
the  statistical  measurement  of  information  by  Shannon,  and 
show  that  Shannon’s  source  and  channel  coding  theorems  have 
semantic  counterparts. 

I.  Background 

It  has  long  been  recognized  that  the  broad  subject  of 
communication  goes  beyond  what  Shannon’s  theory  [54]  and 
many  of  its  extensions  [57]  cover.  Weaver  [60],  just  one  year 
after  Shannon  introduced  his  information  theory,  proposed  that 
communication  involves  problems  at  three  levels  as  follows: 
“LEVEL  A.  How  accurately  can  the  symbols  of 
communication  be  transmitted?  (The  technical  prob¬ 
lem.) 

LEVEL  B.  How  precisely  do  the  transmitted  sym¬ 
bols  convey  the  desired  meaning?  (The  semantic 
problem.) 

LEVEL  C.  How  effectively  does  the  received  mean¬ 
ing  affect  conduct  in  the  desired  way?  (The  effec¬ 
tiveness  problem.)” 

Shannon’s  Classical  Information  Theory  (CIT)  is  deliber¬ 
ately  focused  on  only  Level  A  (technical  level),  thus,  “se¬ 
mantic  aspects  of  communication  are  irrelevant  to  the  en¬ 
gineering  problem”  [54],  As  a  metaphor.  Weaver  said  that 
“an  engineering  communication  theory  is  just  like  a  very 
proper  and  discreet  girl  accepting  your  telegram.  She  pays 
no  attention  to  the  meaning,  whether  it  be  sad,  or  joyous, 
or  embarrassing”.  On  the  other  hand.  Weaver  argued  that 
Shannon’s  information  theory  is  general  enough  to  be  extended 
to  consider  communication  on  levels  B  and  C,  for  instance, 

A  shorter  version  of  this  report  has  been  published  in  the  2011  IEEE  First 
International  Workshop  on  Network  Science. 


by  adding  “semantic  transmitter”,  “semantic  receiver”  and 
“semantic  noise”  to  Shannon’s  communication  model.  This 
vision  is  illustrated  in  Figure  l1. 

The  assumption  that  “semantics  is  not  relevant”  is  no 
longer  true  in  many  forms  of  modern  communications,  such 
as  in  database  queries,  distributed  systems,  human-computer 
interactions,  and  the  Web  (particularly  the  Semantic  Web  [7]). 
There  is  now  a  strong  need  for  an  extension  of  the  classical 
communication  model  to  characterize  not  only  sequences  of 
bits,  but  also  the  meanings  behinds  these  bits.  For  this  goal, 
various  researchers  have  studied  theories  of  “semantic  infor¬ 
mation”  (details  discussed  in  Section  II).  Notable  examples 
include  the  pioneering  work  of  Carnap  and  Bar-Hillel  [9], 
Floridi  [19,  20],  Barwise  and  Seligman  [4,  53],  among  others. 

However,  a  generic  model  of  semantic  communication,  as 
suggested  by  Weaver,  has  still  largely  remained  unexplored 
after  six  decades.  Existing  works  on  semantic  information 
are  limited  in  addressing  some  fundamental  questions  in 
communication  when  the  semantics  of  exchanged  contents 
is  no  longer  negligible.  Some  of  these  problems  include: 
How  can  semantics  help  in  data  compression  and  reliable 
communication?  How  are  semantic  coding/decoding  related  to 
the  engineering  coding/decoding  problems?  What  is  semantic 
noise?  Are  there  achievable  bounds  in  semantic  coding,  ana¬ 
logues  to  the  bounds  established  by  Shannon  in  engineering 
communication?  What  factors  should  we  consider  to  improve 
efficiency  and  reliability  in  semantic  communication? 

This  paper  summarizes  some  of  our  initial  work  in  realizing 
Weaver’s  vision,  by  extending  Shannon’s  theory  of  (technical) 
communication  to  a  theory  of  Level  B  (semantic)  communi¬ 
cation.  Our  work  is  influenced  by  Carnap  and  Bar-Hillel  [9], 
with  new  contributions  in  the  following  areas: 

•  We  show  that  the  work  of  Carnap  and  Bar-Hillel  is 

'Local  knowledge  and  shared  knowledge  in  the  diagram  are  not  mentioned 
by  Weaver. 
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Fig.  1.  A  3-Level  Communication  Model 


a  special  case  of  a  model-theoretical  characterization 
of  semantic  information  sources,  and  present  a  generic 
model  of  semantic  communication; 

•  We  discuss  the  role  of  semantics  in  reducing  source 
redundancy,  and  establish  theoretical  bounds  in  lossless 
semantic  data  compression; 

•  We  define  the  notions  of  semantic  noise  and  semantic 
channel.  By  extending  the  Shannon’s  channel  coding 
theorem,  we  obtain  the  semantic  capacity  of  a  channel. 

The  model  developed  in  the  paper  is  crude,  and  many 
non-trivial  simplifications  are  made.  Most  importantly,  the 
modeling  of  Level  C  (utility  or  effectiveness)  communication 
is  beyond  the  scope  of  this  paper.  We  also  note  that  the  logic- 
based  approaches  we  adopted  may  not  be  adequate  to  capture 
semantics  in  human  communications.  However,  we  believe 
that  these  simplifications  are  necessary  for  us  to  focus  on  the 
“core”  issues  of  semantic  communication,  and  that  even  this 
crude  model  readily  yields  some  interesting  results.  We  believe 
this  model,  after  some  of  the  suggested  extensions,  may  form 
a  foundation  for  a  general  theory  of  semantic  communication. 

II.  Related  Work 

We  first  briefly  review  existing  theories  of  semantic  infor¬ 
mation. 

Efforts  to  extend  CIT  to  capture  semantic  aspects  of  com¬ 
munication  started  shortly  after  Shannon  published  his  paper. 
Carnap  and  Bar-Hillel  (1952)  [9]  were  among  the  first  to 
introduce  a  “semantic  information  theory”  (SIT).  Their  work 
is  henceforth  referred  to  as  Classical  Semantic  Information 
Theory  (CSIT). 

They  distinguish  the  concepts  of  information  and  the 
amount  of  information,  and  measure  the  amount  of  information 


in  a  sentence  in  a  given  language  based  on  logical  probabilities 
(as  opposed  to  the  statistical  probabilities  used  in  CIT)  ranging 
over  the  contents.  Intuitively,  “A  and  B”  has  more  information 
than  “A”  because  it  is  less  likely  to  be  true:  whenever  “A  and 
B”  is  true,  “A”  is  true,  but  not  vice  versa.  Similarly,  “A”  has 
more  information  than  “A  or  B”,  and  a  tautology  (which  is 
trivially  true)  provides  no  information. 

The  logical  probability  of  a  sentence,  therefore,  is  measured 
by  the  likelihood  that  the  sentence  is  true  in  all  possible 
situations.  For  instance,  suppose  “A”  and  “B”  are  independent 
of  each  other,  and  both  are  true  or  false  as  a  result  of  the 
flip  of  a  fair  coin.  There  are  4  possible  situations  with  equal 
possibilities  (i.e.,  0.25): 

•  A  is  false,  B  is  false 

•  A  is  false,  B  is  true 

•  A  is  true,  B  is  false 

•  A  is  true,  B  is  true 

Therefore,  “A  and  B”  is  true  in  only  the  last  situation  and 
its  logical  probability  is  0.25.  Similarly,  the  logical  probability 
of  “A  or  B”  is  0.75.  These  can  be  denoted  using  a  function 
to  as: 

m(A  A  B)=  0.25,  m(A  V  B)  =  0.75 

The  amount  of  semantic  information  in  a  sentence  A  is 
defined  as  the  negative  logarithmic  value  of  m(A),  i.e.,2 

HS(A)  =  -  log  2(to(A)) 

Thus,  HS(A  A  B)  =  2  and  HS(A  V  B)  =  0.415,  while 
HS(A)  =  Hs ( B )  =  1,  matching  the  intuitions  given  above. 

2Camap  and  Bar-Hillel  used  inf  instead  of  Hs. 
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It  has  been  shown  that  logical  inference  does  not  provide 
additional  semantic  information,  that  is: 

A  h  B  =>  HS(A)  >  HS(B) 

where  b  is  the  logical  entailment  relation.  Therefore,  equiva¬ 
lent  sentences  contain  the  same  amount  of  semantic  informa¬ 
tion: 

A  =  B  =>  HS(A)  =  HS(B) 

Essentially,  CSIT  can  be  regarded  as  a  model-theoretical 
approach  to  assign  probabilistic  values  to  logical  sentences. 
Since  paper  [9]  is  limited  to  propositional  logic,  Carnap  and 
Bar-Hillel  use  truth  tables  (with  each  row  called  a  “state 
description”),  which  can  be  seen  as  the  universe  of  all  possible 
models  of  a  propositional  sentence,  to  find  the  chance  that  a 
sentence  is  true.  In  CSIT,  there  is  a  close  relationship  between 
the  quantity  of  information  in  a  sentence  and  the  set  of  its 
models.  If  a  consistent  sentence  has  fewer  models,  it  is  more 
“surprising”  and  contains  more  information.  This  is  similar  to 
the  probabilistic  logics  of  Nilsson  [47]  and  Bacchus  [2],  which 
can  be  extended  to  first-order  languages. 

In  [19,  20],  Floridi  developed  a  Theory  of  Strongly  Seman¬ 
tic  Information  (TSSI).  One  of  his  major  motivations  is  to 
solve  the  so-called  Bar-Hillel-Carnap  Paradox  (BCP)  in  CSIT, 
which  states  that  contradictions  have  an  infinite  amount  of 
information,  i.e.,  m(_L)  =  0,  thus  HS(A.)  =  oo,  where  _L 
is  shorthand  for  A  A  ->A  for  arbitrary  A.  The  basic  idea  is 
that  the  informativeness  of  a  statement  is  measured  by  the 
positive  or  negative  degree  of  semantic  distance  or  deviation 
from  “truth”.  This  is  quite  different  from  CSIT,  which  defines 
informativeness  as  a  function  over  all  situations,  not  over  a 
particular  situation  that  is  chosen  to  be  true. 

However,  it  has  been  noted  that  TSSI  is  incomplete  with 
regard  to  quantifying  all  possible  statements  [15].  There  exist 
propositional  sentences  that  cannot  be  evaluated  using  the 
approach  described  in  [19],  For  these  reasons,  D’Alfonso 
([15]  section  4)  proposed  the  “value  aggregate”  method  that 
captures  both  inaccuracy  and  vacuity,  based  on  formal  models 
of  truthlikeness.  This  method  aggregates  the  differences  of  all 
models  of  a  sentence  to  those  of  the  “true”  state. 

Both  Floridi  and  D’ Alfonso’s  approaches  measure  the  rel¬ 
ative  information  or  misinformation  of  a  statement  against 
another  reference  statement  assumed  to  be  true.  Thus,  the 
information  value  is  always  a  value  between  0  and  1.  This 
approach  is  rooted  in  the  semantic  information  framework 
using  information  flow  and  situation  theory  by  Seligman  and 
Barwise  [4,  53]  and  that  of  Devlin  [17].  However,  Floridi 
and  D’Alfonso’s  approaches  cannot  determine  the  objective 
amount  of  information  when  there  is  no  reference  state¬ 
ment.  Essentially,  their  work  offered  a  semantic  similarity 
(or  divergence)  measurement  between  two  sentences,  not  a 
measurement  of  uncertainty  as  Shannon,  Carnap  and  Bar- 
Hillel  proposed. 

Several  authors  have  investigated  other  approaches  of  mod¬ 
eling  semantic  information,  e.g.,  algebraic  information  theories 
[35,  37],  universal  semantic  communication  [33,  34]  and 


semantic  coding  [61],  Some  recent  work  has  been  collected  in 
two  proceedings  [44,  56].  However,  these  works  do  not  offer 
a  quantitative  measure  of  semantic  information  in  inference- 
capable  sources,  nor  the  study  of  the  role  of  semantics  in 
coding,  which  are  our  main  foci. 

III.  Semantic  Communication:  a  General  Model 

Before  we  can  investigate  the  measurement  of  semantic  in¬ 
formation,  we  need  to  clearly  define  semantic  information  and 
semantic  communication.  The  concept  of  semantic  information 
is  certainly  not  new.  Here  we  will  restrict  ourselves  to  the 
engineering  description  of  this  notion.  For  more  information 
about  the  philosophical  account  of  semantic  information,  see 
the  excellent  survey  in  [20]. 

A.  Goal  of  Semantic  Communication 

Note  that  there  is  a  fundamental  difference  between  the 
goal  of  engineering  communication  and  that  of  semantic 
communication.  Shannon  stated  in  his  paper  [54]  that 
The  fundamental  problem  of  communication  is  that 
of  reproducing  at  one  point  either  exactly  or  approx¬ 
imately  a  message  selected  at  another  point. 

Weaver  [60]  stated  that 

The  semantic  problems  are  concerned  with  the  inter¬ 
pretation  of  meaning  by  the  receiver,  as  compared 
with  the  intended  meaning  of  the  sender. 

Comparing  the  two  statements,  we  can  state  that  the  goal 
of  semantic  communication  is  not  to  reproduce,  exactly  or 
approximately,  the  messages  transmitted,  but  their  interpreta¬ 
tions.  For  example,  consider  the  conversation: 

Alice:  “Are  you  free  this  weekend?” 

Bob:  “No,  I’m  busy  on  both  Saturday  and  Sunday.” 

Alice  is  a  semantic  source  (sender)  and  Bob  is  a  semantic 
destination  (receiver).  Bob  is  able  to  interpret  the  meanings 
of  the  received  message  and  relates  it  to  the  meanings  of 
the  vocabulary  he  already  knows.  He  knows  that  “free”  is  an 
antonym  of  “busy”  and  that  “weekend”  means  “Saturday”  or 
“Sunday”.  He  is  able  to  infer  that  “free  this  weekend”  is  the 
same  as  “not  busy  on  both  Saturday  and  Sunday”,  even  if  the 
two  statements  are  syntactically  different. 

For  a  classical  information  source,  a  message  is  a  sequence 
of  symbols.  In  a  semantic  information  source,  a  message, 
which  may  still  be  syntactically  viewed  as  a  sequence  of 
symbols,  is  in  fact  an  expression  composed  using  the  symbols 
in  the  language  of  the  source.  What  we  want  to  achieve  is 
the  faithful  transmission  of  meanings  of  these  expressions, 
not  their  syntactic  representations,  which  is  the  concern  of 
engineering  communication. 

Now  consider  a  conversation  between  three  persons: 

Alice:  “Bob,  is  Charlie  free  this  weekend?” 

Bob:  “Charlie,  Alice  asks  if  you  are  available  this 
weekend?” 

Charlie:  “No,  I’m  not  available  on  both  Saturday  and 
Sunday.” 
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Here  Bob  serves  as  a  semantic  channel  between  Alice 
and  Charlie.  Bob  does  not  faithfully  convey  the  original 
message  from  Alice,  however,  he  is  still  able  to  preserve 
the  original  meaning  of  the  message  of  the  sender.  There 
may  be  an  engineering  failure  if  we  measure  the  success  of 
communication  “literally”,  but  there  is  no  semantic  failure. 

Even  if  there  is  no  engineering  communication  failure,  there 
may  still  be  semantic  communication  failure.  Considering  a 
conversation  about  a  “Lecturer”  in  universities,  a  US  person 
who  is  not  familiar  with  UK  academic  ranks  may  interpret  it 
to  be  similar  to  a  non-tenure-track  position  in  US,  whereas 
“Lecturer”  in  the  UK  is  roughly  equivalent  to  “Assistant 
Professor”  in  the  US  system. 

B.  Semantic  Sources 

A  real  world  semantic  source  may  be  a  complicated  system 
which  can  make  statements  with  subtle  semantic  distinctions. 
In  this  paper,  we  will  not  try  to  model  every  form  of  semantic 
source,  but  a  very  basic  type  that  can  make  factual  statements 
in  propositional  logic.  This  simplification  will  help  us  focus  on 
the  key  modeling  problem,  and  we  will  discuss  its  extensions 
later. 

For  a  hypothetical  example,  suppose  a  child  asks  her  father 
what  is  “Tweety”.  The  father,  as  an  information  source,  may 
do  the  following: 

•  (Observing  World)  He  searches  the  Web  and  finds  a 
webpage  about  Tweety.  There  are  many  such  pages.  Most 
of  them  are  about  Tweety  the  bird,  but  a  few  are  about  a 
Twitter  client,  or  a  basketball  player. 

•  (Inferring)  Depending  on  which  page  the  father  visits  and 
trusts,  the  father  may  use  his  knowledge  to  come  up 
with  an  appropriate  answer  for  his  child.  For  instance, 
the  webpage  may  tell  him  that  “Tweety  is  a  canary”,  but 
since  the  child  may  not  understand  “canary”  yet,  and  the 
father  knows  that  canaries  are  birds,  he  may  infer  that 
“Tweety  is  a  bird”. 

•  (Transmitting)  The  father  most  likely  answers  his  child  in 
English  that  “Tweety  is  a  bird”,  but  there  is  some  positive 
probability  that  he  instead  answers  “Tweety  is  software” 
or  “Tweety  is  a  man”. 

For  the  message  “Tweety  is  a  bird”,  the  unit  of  symbols  is 
English  words.  Thus,  from  a  non-semantic  (syntactic)  point  of 
view,  the  message  is  a  sequence  of  4  symbols.  Its  classic  in¬ 
formation  can  be  approximately  determined  by  the  frequencies 
of  English  words. 

Now,  we  regard  this  message  as  a  semantic  message,  e.g., 
a  human  friendly  coding  of  the  proposition  birdxweety  The 
source  states  it  because  the  source  believes  that  it  is  “true” 
w.r.t.  its  observations  about  the  world3.  On  the  other  hand, 
whether  a  message  is  true  or  not  is  irrelevant  in  classical 
information  theory. 

Informally,  we  say  a  semantic  source  is  an  entity  that  can 
emit  messages  using  a  given  syntax,  such  that  these  messages 

3It's  possible  that  a  source  intentionally  sends  out  wrong  messages  to 
deceive  the  destination.  However,  we  believe  that  such  situations  should  be 
studied  as  Level  C  communication,  not  as  Level  B  (semantic). 


are  “true”  in  the  source,  according  to  its  state  and  inference 
capabilities. 

C.  A  Semantic  Communication  Model 

What,  then,  is  semantic  communication ?  When  a  semantic 
information  source  (e.g.,  the  father  in  the  example  above) 
sends  a  message,  the  source  expects  the  destination  (e.g., 
the  child)  to  “understand”  the  message  to  some  degree.  The 
destination,  thus,  rather  than  mechanically  decoding  the  syntax 
of  the  message,  will  be  able  to  draw  conclusions  from  the 
received  message,  as  well  as  from  its  current  local  knowledge. 
In  the  above  example,  the  child,  after  learning  that  “Tweety  is 
a  bird”,  may  infer  that  “Tweety  is  an  animal”,  if  her  knowledge 
base  tells  her  that  “birds  are  animals”. 

Figure  2  characterizes  a  model  of  semantic  communication 
we  will  use  in  this  paper.  Formally,  a  semantic  information 
source  is  a  tuple  (Ws,  Ks,  Is,  Ms),  where 

•  Ws  is  the  model  of  worlds  potentially  observable  by  the 
source; 

•  Ks  is  the  background  knowledge  base  of  the  source; 

•  Is  is  the  inference  procedure  used  by  the  source; 

•  Ms  is  the  message  generator  used  by  the  source  to  encode 
a  message. 

In  this  model,  the  source  builds  its  own  world  model  by 
observing  the  outside  world.  In  the  “Tweety”  example,  the 
world  is  observable  using  a  search  engine.  In  this  generic 
model,  we  do  not  specify  how  the  world  is  represented,  and 
the  kind  of  semantic  relations  between  the  world  model  and 
the  messages.  There  are  several  different  ways  this  may  be 
done,  e.g.,  by  using  model-theoretic  semantics,  operational 
semantics,  lexical  semantics,  or  by  many  forms  of  cognitive 
models  of  semantics  [14]. 

The  message  generator  (or  semantic  encoder)  generates 
messages  according  to  defined  strategies.  Since  usually  there 
are  many  different  but  semantically  valid  ways  to  describe  one 
situation,  the  message  generator  has  great  freedom  in  picking 
a  “good”  code.  For  instance,  the  generator  may  send  messages 
that  are  most  accurate,  or  that  are  easy  to  generate  (according 
to  some  cost  function),  or  that  the  destination  is  most  interested 
in.  Also,  similar  to  the  engineering  transmitter,  the  message 
generator  may  deal  with  both  how  to  reduce  redundancy  in 
messages  (source  coding),  and  how  to  improve  the  reliability 
of  the  transmission  (channel  coding). 

Possible  outputs  of  the  message  generator  can  be  seen  as 
an  interface  language  for  the  source.  For  instance,  regarding  a 
graph,  one  interface  language  may  be  the  reachability  between 
nodes;  another  may  be  minimal  distances  between  nodes. 

The  generated  message  will  be  transmitted  over  a  conven¬ 
tional  (i.e.,  non-semantic)  channel,  in  which  a  conventional 
transmitter  and  a  conventional  receiver  will  take  care  of  the 
engineering  coding/decoding  tasks. 

Analogous  to  the  source,  a  semantic  information  destination 
(receiver)  is  a  tuple  (Wr,Kr,  Ir,Mr),  where 

•  Wr  is  the  world  model  of  the  receiver; 

•  K,  is  the  background  knowledge  base  of  the  receiver; 

•  Ir  is  the  inference  procedure  used  by  the  receiver; 


4 


Sender 


Receiver 


Feedback  (?) 


Background 

(  > 

Inference 

1  < -  1 

1  1 

Background 

/  \ 

Inference 

Knowledge 

Procedure 

1  1 

1  1 

Knowledge 

Procedure 

_ 

Ms 


{m} 


World 

Message 

1 

1 

Messages 

_ r 

1 

1 

Message 

World 

model 

A 

generator 

1 

1 

1 

interpreter 

model 

V  J 

observations 


M:  Message  Syntax 


Fig.  2.  Semantic  Information  Source  and  Destination 


•  Mr  is  the  message  interpreter  (semantic  decoder). 

A  semantic  communication  error  occurs  if  the  message  to 
be  sent  is  “true”  at  the  source  (w.r.t.  Ws,  Ks  and  Is),  but  the 
received  message  is  “false”  at  the  destination  (w.r.t.  Wr,  Kr 
and  Ir).  The  error  may  be  due  to  losses  in  source  coding,  noise 
in  the  channel,  losses  in  decoding,  or  their  combinations. 

Note  that  background  knowledge  and  inference  procedures 
may  be  fully  or  partially  shared  by  the  source  and  the  desti¬ 
nation  in  semantic  communication.  It  is  possible  for  them  to 
use  different  background  knowledge  or  inference  rules,  which 
may  lead  to  different  truth  evaluations  and,  hence,  semantic 
mismatches.  There  may  also  be  feedback  channels  from  the 
destination  to  the  source.  The  source,  channel  and  destination 
all  may  have  memories  (e.g.,  a  Markov  source),  or  may  be 
continuous.  To  simplify  discussion,  we  leave  these  extensions 
for  future  work. 

IV.  Measuring  Semantic  Information  and 
Semantic  Data  Compression 

Now  we  discuss  the  general  principles  of  measuring  the 
amount  of  semantic  information  of  sources,  and  the  role  of 
semantics  in  data  compression  (source  coding).  A  model- 
theoretic  semantics  is  studied  in  this  and  the  next  section,  but 
we  note  that  this  is  not  the  only  possible  approach  in  realizing 
our  generic  semantic  communication  model. 

A.  Entropy  of  Semantic  Messages 

In  CIT,  the  entropy  of  a  message  is  determined  by  the 
statistical  probability  of  the  symbols  appearing  it.  In  CSIT,  the 
entropy  of  a  statement  is  determined  by  its  logical  probability, 
i.e.,  the  likelihood  of  observing  a  possible  world  (model) 
in  which  this  statement  is  true.  To  see  the  difference,  for 
instance,  the  message  “Rex  is  not  a  tyrannosaurus”  (Ml)  is 
less  “surprising”  than  “Rex  is  not  a  dog”  (M2),  not  because 


the  word  “tyrannosaurus”  is  more  common  than  “dog”,  but 
because  the  individuals  represented  by  “tyrannosaurus”  (now 
considered  extinct)  are  less  common  than  the  individuals 
represented  by  “dog”.  Thus,  Ml  has  less  semantic  information 
than  M2,  even  if  it  may  have  more  Shannon  information  based 
on  the  statistical  distribution  of  English  words. 

As  another  example  of  a  semantic  information  source  in  a 
broader  sense,  information  carried  by  DNA  is  encoded  using 
a  four-letter  alphabet  (bases  A,  G,  C,  U).  DNA’s  syntactical 
entropy  can  be  obtained  using  statistical  studies  of  bases  or 
sequences  of  bases,  with  estimation  ranging  from  1.6  to  1.9 
bits  per  base  [3,  41,  51].  However,  the  “semantics”  of  DNA  is 
only  expressed  after  a  complex  process,  producing  functional 
gene  products  such  as  RNAs  or  proteins.  The  process  is  not 
yet  fully  understood,  but  it  has  been  observed  that  variations  of 
DNA  do  not  necessarily  result  in  different  gene  products  [59], 
nor  will  DNA  be  expressed  in  exactly  the  same  way  under 
different  conditions  [45,  46].  If  we  measure  the  amount  of 
information  carried  in  a  DNA  molecule  based  on  its  functional 
gene  products,  our  conjecture  is  that  it  might  be  different  from 
the  DNA’s  syntactical  entropy. 

Below,  we  define  semantic  entropy,  following  and  extending 
the  CSIT  approach.  For  simplicity,  as  in  [9],  we  restrict  our 
discussion  to  propositional  logic. 

We  assume  that  the  source  has  the  following  properties: 

•  The  world  model  Ws  is  a  set  of  interpretations  with 
a  probability  distribution  p.  For  propositional  logic,  an 
interpretation  is  a  set  of  positive  propositions. 

•  The  inference  procedure  Is  is  a  satisfiability  reasoner  for 
propositional  logic. 

•  The  message  generator  Ms  generates  messages  by  some 
fixed  coding  strategy,  such  that  if  the  observed  value  of 
the  world  model  is  w  and  it  generates  a  message  x,  it 
must  be  the  case  that  w  1=  x  (verified  by  Is),  where  1=  is 
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the  usual  propositional  satisfaction  relation. 

We  will  omit  the  subscript  s  when  there  is  no  confusion. 
Let  H(W)  be  the  Shannon  entropy  of  W,  i.e.. 


H(W)  =  -  Y  MMlog2/u(» 

w£W 


If  the  source  is  a  classical  source  with  W  as  the  symbol  set, 
H(W)  will  be  precisely  the  entropy  of  the  source.  We  call 
H(W )  the  model  entropy  of  the  semantic  source. 

For  a  message  (sentence)  x,  let  IVX  be  the  set  of  its  models, 
i.e.,  worlds  in  which  x  is  “true”,  Wx  =  (u>  6  W\w  \=  x}. 
Note  that,  unlike  CSIT,  which  relies  on  counting  models  of 
a  sentence,  when  interpretations  have  different  probabilities, 
what  matters  is  the  total  probability  of  models  of  the  sentence, 
not  the  cardinality  of  the  set  of  models.  Then,  the  logical 
probability  of  a  message  (sentence)  x  is 


m{x) 


MWx) 

p(W) 


E 

W(zW,W  \=x 

E  m O) 

w€W 


Since  p  is  a  probability  measure,  when  W  is  not  constrained 
by  the  background  knowledge,  ^(tn)  =  1. 

w£W  _ 

As  in  CSIT,  we  define  the  semantic  entropy  of  x  as 


Hs{  x)  =  -  log  2(m{x)) 


Carnap  and  Bar-Hillel  [9]  gave  some  justifications  for  using 
logarithm  in  their  definition.  The  measurement  satisfies  some 
common-sense  requirements  for  measuring  semantics.  For 
propositional  logic,  we  observe: 

.  HS(A  A  B)  >  Ha(A) 

.  HS(A  V  B)  <  Ha(A) 

•  HS(A  \-  B)  =>  HS(A)  >  HS(B) 

,  HS(A  V  ->A)  =  0 


B.  Conditional  Entropy  and  Background  KB 

CSIT  is  concerned  with  inferring  logical  probability  (thus, 
semantic  information)  of  a  propositional  expression  when 

•  There  is  no  background  knowledge 

•  These  propositions  are  independent  of  each  other 

In  this  subsection,  we  relax  these  two  assumptions.  When 
there  is  a  background  knowledge  base  K,  the  set  of  possible 
worlds  will  be  restricted  to  the  set  compatible  with  K .  The 
semantic  entropy  of  a  sentence  is  represented  as  a  conditional 
logical  probability: 


m(x\K) 


E 

wEW,w\=K,x 

E  m  (w) 

w£W,w\=K 


and 

Hs{x\K)  =  log2m(x\K) 

For  a  simple  example,  suppose4p(A)  =  p{B)  =  0.5,  A ,  B 
independent  and  we  have  the  background  knowledge  K  = 
{A  — >  B}.  The  truth  table  is 


4We  always  use  p  to  represent  statistical  probabilities,  and  m  for  logical 
probabilities. 
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probability 
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0.25 

3 

1 

0 

0 

0.25 

4 

1 

1 

1 
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Then  the  universe  of  possible  worlds  “shrinks”  to  the  set  of 
truth  assignments  in  which  A  — >  B  is  true,  i.e.,  cases  1,  2  and 
4.  Therefore,  we  now  have  conditional  logical  probabilities 

m(A\K)  =  1/3 

m(B\K)  =  2/3 

m(A  A  B\K)  =  1/3 


Logical  probabilities  are  different  from  a  priori  statistical 
probabilities  due  to  the  presence  of  background  knowledge. 
In  the  new  distribution,  A  and  B  are  no  longer  logically 
independent  (as  m{A\K)m{B\K)  ^  m(A  A  B\K)). 

Let  p!  be  the  new  distribution  of  the  set  of  models  when  K 
is  present,  that  is. 


p'{w) 


d{w) 

J2vew,v\=i<  ll(v) 


H(W\K)  =  —  Y  l^(w)log2{p’(w)) 

w£W,w\=K 

The  model  entropies  of  the  source  in  the  example  without 
and  with  the  background  knowledge  are 

H(W)  =-4*  0.25  log2(0.25)  =  2 
H(W\K)  =  -3  *  1/3  log2 (1/3)  =  1.585 

It  seems  that  the  presence  of  background  knowledge  reduces 
the  informativeness  of  the  source.  This  is  true  when  the  source 
does  not  share  background  knowledge  with  the  destination. 
However,  if  the  background  knowledge  is  shared,  the  reduction 
in  semantic  entropy  means  that  we  can  compress  the  source 
without  losing  information.  In  general,  with  the  help  of  shared 
background  knowledge,  we  will  be  able  to  communicate  with 
shorter  messages  to  achieve  the  maximal  informativeness  of 
the  source.  In  the  example  above,  this  means  that  state  descrip¬ 
tions  (the  most  informative  messages)  need  only  1.585  rather 
than  2  bits  to  describe.  The  21%  saving  is  the  contribution  of 
the  shared  background  knowledge  in  compressing  the  source. 

C.  Semantic  Source  Coding 

For  a  propositional  logic  with  finite  n  propositions,  the 
size  of  all  possible  interpretations  (worlds)  is  finite  ( 2n ). 
The  number  of  all  possible  messages  (syntactically  valid 
propositional  expressions),  however,  may  be  infinite  if  the 
length  of  messages  is  not  restricted.  Since  an  interpretation 
in  general  cannot  uniquely  determine  messages,  a  semantic 
coding  strategy  is  necessary. 

For  an  information  source  of  engineering  interest,  the  num¬ 
ber  of  all  possible  messages  is  in  general  only  finite,  or 
is  restricted  in  other  ways.  The  interface  language  of  the 
source  thus  only  allows  a  subset  of  all  possible  messages.  For 
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example,  a  Twitter  post  is  limited  to  140  characters,  and  a  G- 
rated  movie  cannot  contain  scenes  unsuitable  for  children.  For 
a  given  interface  language,  a  semantic  coding  strategy  needs 
to  achieve  two  potentially  conflicting  goals: 

•  Maximizing  expected  faithfulness  in  representing  ob¬ 
served  worlds; 

•  Minimizing  expected  coding  length. 

Let  X  be  a  finite  set  of  allowed  messages.  A  seman¬ 
tic  coding  strategy  is  a  conditional  probabilistic  distribution 
P(X\W).  A  deterministic  coding  is  a  special  case  of  coding, 
where  each  w  G  W  has  at  most  one  possible  coded  message. 
Given  p(W)  and  P(X\W),  the  distribution  of  expressed 
messages  P{X)  can  be  determined  using 

P{x)  =  ^  p,(w)P(x\w) 

W 

Let  us  define  H{X)  as  the  Shannon  entropy  of  messages 
X  with  the  distribution  P(X),  i.e., 

H{X)  =  -  E  P lo§2  p(x ) 

x€X 

The  following  theorem  establishes  the  relation  between  the 
model  (semantic)  entropy  and  the  message  (syntactic)  entropy 
of  a  source: 

Theorem  1:  H{X)  =  H{W )  +  H(X\W)  -  H{W\X). 

Proof  sketch:  By  definitions  of  entropy  and  conditional 
entropy. 

Intuitively,  H(X\W)  measures  semantic  redundancy  of  the 
coding,  and  H(W\X)  measures  semantic  ambiguity  of  the 
coding.  The  theorem  states  that  message  entropy  can  be 
larger  or  smaller  than  model  entropy,  depending  on  whether 
redundancy  or  ambiguity  is  larger. 

When  H{X)  <  H(W),  there  is  an  information  loss 
(. H(W )  —  H(X)).  Sometimes,  the  loss  in  coding  is  an  in¬ 
tentional  and  desired  compression  of  the  source.  For  instance, 
textual  description  of  an  image  gives  only  a  semantic  abstract 
of  the  image.  A  temperature  report  about  a  city  usually 
gives  only  an  average  value,  hiding  detailed  reports  from 
participating  temperature  monitoring  stations. 

We  can  also  view  the  model  entropy  of  a  semantic  infor¬ 
mation  source  as  the  maximal  expected  (message)  entropy 
per  message  without  redundancy.  Let  Hmax  be  the  maximal 
message  entropy  of  a  source,  then 


When  there  is  no  redundancy,  every  pair  of  messages  have 
no  shared  models.  Also  since  for  any  w  1=  a,  m(a)  >  p(w), 
therefore 

Hmax  —  E  p{w)  log2(p(w))  =  H(W) 

w£W 

Such  maximality  is  reached  when  the  messages  are  descrip¬ 
tions  of  the  models  themselves.  In  the  case  of  CSIT,  this  means 
that  a  most  informative  coding  will  always  give  the  full  state 
description. 


D.  Use  Semantics  for  Data  Compression 

Some  extensions  of  CIT  exploit  side  information,  i.e.,  re¬ 
ceiver’s  prior  knowledge  about  the  sender,  to  reduce  the  length 
of  the  code.  Classical  results  in  this  area  [55]  describe  how  to 
achieve  optimal  coding  with  respect  to  the  joint  entropy  of  the 
source  and  the  side  information.  In  semantic  communication, 
shared  knowledge  and  inference  procedures  may  act  as  a 
special  kind  of  side  information  to  improve  coding  efficiency 
(i.e.,  compression).  On  the  other  hand,  unlike  in  CIT,  semantic 
side  information  is  not  represented  as  distributions,  but  as 
logical  statements  and  inference  procedures. 

With  the  presence  of  semantics,  some  messages  may  be 
semantically  equivalent  to  other  messages,  and  if  the  equiv¬ 
alency  is  captured  by  shared  knowledge,  this  can  be  used  to 
compress  the  source.  For  example  a  — >  (a  A  b)  V  (b  A  c)  can 
be  reformulated  as  a  — >  b.  If  a  message  has  many  equivalent 
forms,  we  can  pick  a  subset  of  the  forms,  hence  reducing  the 
entropy  of  the  source  without  a  “real”  (semantic)  loss. 

To  what  extent  is  semantic  compression  possible?  For  a 
source  with  a  message  interface  language  X  and  message 
distribution  P(X ),  let  X  be  the  smallest  subset  of  X  such 
that 

\/x  G  X,3x  G  X  s.t.  x  -H-  x 

and 

p\x)  =  y, 

X  s.t.  x<->x 

For  a  message  x  in  X,  x  is  its  unique  semantic  normal  form 
in  X.  The  next  theorem  states  the  bound  for  lossless  semantic 
compression. 

Theorem  2:  For  a  semantic  source  with  interface  language 
X,  there  exists  a  coding  strategy  to  generate  a  semantical¬ 
ly  equivalent  interface  language  X'  with  message  entropy 
H(X')  >  H{X).  No  such  X'  exists  with  message  entropy 
H(X')  <  H(X). 

Proof  sketch:  The  existence  part  is  trivial.  The  non-existence 
part  is  shown  by  the  uniqueness  of  semantic  equivalent  normal 
forms. 

The  difference  H(X)  —  H  (X)  can  be  a  large  reduction 
if  the  redundancy  in  semantically  equivalent  messages  is 
large.  For  example,  a  formula  in  full  disjunctive  normal  form 
with  j  different  clauses  and  k  different  propositions  has  at 
least  k\j\  semantically  equivalent  forms.  For  propositional 
logic  with  n  propositions,  22  semantic  equivalence  classes 
of  messages  exist.  If  22  <  |X|  (|W|  is  the  cardinality  of 

the  set  of  messages),  the  reduction  can  be  significant.  For 
example,  suppose  our  vocabulary  allows  only  connectives  V,  A 
and  n  proposition  names.  A  propositional  message  can  be 
represented  with  a  grammar  tree  with  internal  nodes  labeled 
with  connectives  and  leaves  being  propositions.  A  grammar 
tree  of  depth  d  (thus,  with  message  length  0(2d  ))  may  have 
2<i(d+i)/2n2  p0SSible  variations.  Thus,  for 

d  >  \J (1/2  +  2  log?r)2  +  2n+1 

22"  <  \X\  is  true.  This  translates  into  a  message  length  limit 
of  0(22  +  )  or  larger. 
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Other  semantic  data  compression  strategies  may  be  ex¬ 
plored.  One  possible  approach  is  to  reduce  the  model  en¬ 
tropy  of  a  source,  e.g.,  instead  of  measuring  all  models, 
measure  only  minimal  models  [39].  When  some  semantic 
infidelity  is  allowed,  lossy  semantic  coding  strategies  may  be 
used  based  on  semantic  similarity  between  messages  (e.g., 
“black”— >  “dark”). 

E.  Implementation 

A  semantic  information  calculator  has  been  provided5.  The 
calculator  is  able  to  calculate  the  semantic  entropy  of  a  propo¬ 
sitional  message  and  the  model  entropy  of  a  propositional 
semantic  information  source. 

V.  Semantic  Noise  and  Channel  Coding 
A.  Semantic  Noise 

For  communication  over  a  noisy  channel,  the  received 
message  may  contain  errors.  The  noise  may  be  added  either 
at  the  engineering  level  or  at  the  semantic  level.  Below  are 
some  examples  of  semantic  infidelity  in  communication: 

•  The  meaning  of  a  message  is  changed  due  to  transmission 
errors,  e.g.,  from  “copy  machine”  to  “coffee  machine”. 

•  Translation  of  one  natural  language  into  another  language 
where  some  concepts  in  the  two  languages  have  no 
precise  match; 

•  The  source  uses  English  units,  while  the  receiver  under¬ 
stands  it  using  metric  units  (e.g.,  during  the  loss  of  the 
Mars  Climate  Orbiter6); 

•  A  message  may  be  misunderstood  due  to  cultural  dif¬ 
ferences,  as  demonstrated  in  the  incidents  of  UK  Prime 
Minister  David  Cameron’s  wearing  of  a  poppy  during  a 
visit  to  China7  and  in  Karen  Hughes’  speech  on  women’s 
rights  to  a  non-US  audience  [12], 

A  key  difference  between  engineering  communication  and 
semantic  communication  is  how  infidelity  is  handled.  Let 
X  be  the  input  of  the  channel  and  Y  be  the  output  of 
the  channel.  In  engineering  communication,  the  goal  is  to 
minimize  the  expected  difference  between  X  and  Y ,  and  a 
particular  mapping  x  — >  y  (x  is  a  value  of  X  and  y  is  a  value 
of  Y )  is  either  a  match  or  not.  In  semantic  communication, 
we  are  concerned  with,  instead  of  syntactic  preservation  of  the 
message,  the  semantic  similarity  between  the  input  and  output 
messages.  Also  note  that  not  all  syntactic  errors  will  lead  to 
semantic  errors.  Suppose  that  the  input  message  is  Xi  —>  X2 
and  the  received  message  is  X2  V  -1X1,  there  is  no  semantic 
loss.  Thus,  the  semantic  effect  of  noise  may  be  lower  than  its 
impact  on  syntax  transmission  due  to  the  presence  of  semantic 
redundancy. 

In  this  paper  we  will  not  address  communication  failures 
due  to  culture,  contexts,  background  knowledge,  default  as¬ 
sumptions,  or  other  factors  that  may  influence  effectiveness 

5  http://www.cs.rpi.edu/~baojie/sit/index.php 

f’http://en.  wikipedia.org/wiki/Mars_Climate_Orbiter 

7http://bit.ly/fhMzUn 


(“Level  C”)  of  communication.  A  general  discussion  of  com¬ 
munication  failure  is  also  beyond  the  scope  of  this  paper. 

For  a  source  state  (interpretation)  w,  an  input  message  x  and 
an  output  message  y,  there  are  two  kinds  of  semantic  errors8: 

•  Unsoundness:  the  sent  message  is  true  but  the  received 
message  is  false,  i.e.,  w  k  x  but  y 

•  Incompleteness:  the  sent  message  is  false  but  the  received 
message  is  true,  i.e.,  w  \=  y  but  w  x 

Some  communication  tasks  may  tolerate  one  kind  of  error 
(e.g.,  incompleteness)  more  than  the  other.  In  this  paper,  since 
we  do  not  consider  lossy  source  coding,  i.e.,  w  \=  x  is  always 
true,  our  goal  is  to  reduce  unsoundness,  formally  stated  as: 

max  E  P(w,x,y) 

w\=y 

where  p(w,x,y)  is  the  joint  distribution  of  w,x,y.  For  a 
semantic  source,  p(w,x,y)  =  p(y\w,  x)p(w,  x)  where 

p{y[w,x)  =p{y\x) 

since  transmission  of  the  message  is  independent  of  source 
coding.  Note  that  p(y\x)  is  the  semantic  channel  transition 
distribution. 

p(w,  x)  =  p{x\w)p(w) 

where  p(x\w)  is  determined  by  the  semantic  encoder  (message 
generator),  and  p{w)  is  the  logical  distribution  of  interpreta¬ 
tions.  Thus,  our  goal  is 

max  E  p(y\x)p(x\w)p(w) 

w\=y 

Since  p(y\x)  is  determined  by  the  semantic  channel,  and 
p(w)  is  determined  by  the  source,  the  goal  of  semantic  channel 
coding  thus  is  to  optimize  the  coding  scheme  p(x\w),  i.e., 
given  an  observed  world,  choose  the  strategy  that  can  best 
tolerate  noise.  For  instance,  if  a  voice  channel  has  a  high 
possibility  of  confusing  “p”  and  “ff”,  “copy  machine”  may 
be  received  as  “coffee  machine”.  Alternatively,  assuming  that 
both  sides  use  “Xerox”  as  a  synonym  of  “copy  machine”, 
“Xerox”  may  reduce  the  chance  of  misunderstanding. 

Another  way  to  overcome  noise  is  to  introduce  semantic 
redundancy  into  a  message.  For  example,  in  HTML,  an  ‘img’ 
object  (image)  may  have  an  ‘alt’  attribute  which  gives  a  textual 
description  of  the  image  and  will  be  shown  instead  if  the  image 
itself  is  not  transmitted.  Note  that  semantic  redundancy  may 
not  necessarily  lead  to  syntactical  redundancy.  For  example, 
suppose  the  topic  of  communication  is  weekdays,  then  the 
message  “MonVTueVWedVThuVFri”  can  be  reformulated  as 
a  shorter  message  “-iSatA-iSun”.  The  two  parts  of  the  refor¬ 
mulated  message  contain  semantic  redundancy  such  that  if  one 
part  is  lost  in  transmission,  the  received  message  is  still  sound 
(although  not  semantically  equivalent  to  the  original  message). 

sNote  that  here  we  implicitly  adopted  a  global  semantics  assumption,  that 
is.  the  sender  and  the  receiver  share  the  same  universe  of  interpretations. 
Under  certain  circumstances,  this  assumption  may  not  be  valid  and  a  local 
model  semantics  [23]  may  be  needed. 


B.  Semantic  Channel  Capacity 

Analogous  to  CIT,  a  noisy  semantic  channel  has  a  capacity 
limit  such  that  a  transmission  rate  can  be  achieved  with 
arbitrarily  small  semantic  errors  within  the  limit.  First,  we 
explain  some  notations  to  be  used  in  the  theorem. 

•  I(X;Y)  =  H(X)  —  H(X\Y)  is  the  mutual  information 
between  X  and  Y .  It  represents  syntactical  channel 
equivocation,  which  may  be  a  result  of  technical  noise 
or  non-literal  semantic  transmission. 

•  Hkb,is  (W\X)  is  the  equivocation  of  the  semantic  en¬ 
coder,  given  the  sender’s  local  knowledge  Ks  and  in¬ 
ference  procedure  Is.  Intuitively,  a  higher  I ! k„j.(W\X) 
means  higher  semantic  ambiguity  in  semantic  coding. 

•  Hs.Krjr(Y)  =  —Hyp(y)Hs(y)  is  the  average  logical 
information  of  received  messages,  given  the  receiver’s 
local  knowledge  Ks  and  inference  procedure  Is.  A  higher 
Hs:Krjr  (Y)  means  stronger  ability  of  the  receiver  to 
interpret  received  messages. 

For  a  simplified  model,  we  assume  I\s  =  Kr  and  Is  =  Ir 
and  omit  the  subscript.  The  limit  is  given  in  the  theorem  below: 

Theorem  3  (Semantic  Channel  Coding  Theorem ):  For  ev¬ 
ery  discrete  memoryless  channel,  the  channel  capacity 

Cs=  sup  {I{X;Y)  -  H(W\X) +  H^Y)} 

P(X\W) 

has  the  following  property:  For  any  e  >  0  and  R  <  Cs,  there 
is  a  block  coding  strategy  such  that  the  maximal  probability 
of  semantic  error  is  <  e. 

The  argument  of  sup  is  the  semantic  coding  strategy.  A 
proof  sketch  is  given  in  the  appendix.  The  proof  uses  a  strategy 
similar  to  that  used  by  Shannon  [54]  in  deriving  engineering 
channel  capacity,  using  the  Asymptotic  Equipartition  Property 
(AEP). 

Note  that  semantic  channel  capacity  may  be  higher  or 
lower  than  the  engineering  channel  capacity  (sup{/(A';  Y)}), 
depending  on  whether  HS(Y )  or  H(W\X)  is  larger.  This 
implies  that  using  a  semantic  encoder  with  low  semantic 
ambiguity  and  a  semantic  decoder  with  strong  inference  ability 
and/or  a  large  shared  knowledge  base,  we  may  achieve  high- 
rate  semantic  communication  using  a  low-rate  engineering 
channel. 

VI.  Discussions 

The  basic  model  presented  in  the  paper  has  many  lim¬ 
itations.  However,  we  believe  the  model  is  fairly  general 
for  future  extensions.  Some  such  potential  extensions  have 
been  discussed  in  the  paper.  Here,  we  list  some  additional 
extensions. 

A.  Intended  Messages  and  Expressed  Messages 

It  is  often  the  case  that  people  intend  to  say  something,  but 
due  to  practical  reasons  or  restrictions  in  the  language,  are 
not  able  to  express  precisely  the  intended  message  (see  Figure 
1).  A  person  in  a  foreign  country  with  limited  knowledge  of 
the  local  language,  a  little  child  with  a  small  vocabulary,  or 
an  animal  trainer  who  must  give  instructions  using  symbols 


comprehensible  to  an  animal,  are  some  typical  examples.  An 
intended  message  is  an  exact  coding  of  the  observed  world, 
while  what  is  actually  expressed,  an  expression,  may  or  may 
not  be  the  same,  hence  causing  a  semantic  loss. 

Sometimes,  the  loss  is  intentional  and  of  practical  value, 
e.g.,  the  loss  caused  by  an  abstract  of  a  paper,  a  real-time 
voice  commentary  of  a  game,  or  transcript  of  a  talk.  In  our 
simplified  model,  we  do  not  distinguish  intended  messages 
from  expressed  messages,  but  studying  their  relations  is  cer¬ 
tainly  important  in  the  future. 

Lossy  source  coding  studies  finding  expressed  messages 
with  minimal  expected  semantic  errors  with  respect  to  intend¬ 
ed  messages.  Such  a  coding  strategy  may  rely  on  semantic 
similarity  measurements  between  messages.  There  is  an  exten¬ 
sive  literature  on  similarity  measuring,  e.g.,  [10,  40,  50].  Some 
promising  candidates  include  Lin’s  similarity  measure  [40]  and 
Normalized  Compression  Distance  [58], 

B.  First-Order  Languages 

For  a  first-order  language,  it  may  not  have  the  finite  model 
property,  or  the  finite  domain  property  for  its  models.  Thus, 
it  may  be  difficult  to  evaluate  the  distribution  of  its  models. 
Therefore,  it  may  be  difficult  to  obtain  model  entropy  of  a 
semantic  source  with  a  first-order  language  as  background 
knowledge.  We  also  need  to  consider  several  additional  syntax 
constructs: 

•  How  to  handle  variables? 

•  How  to  handle  quantifier? 

•  What  are  the  logical  probabilities  of  first-order  sentences 
with/without  free  variables? 

When  we  talk  about  probability  of  a  logical  sentence,  it 
should  be  noted  there  are  two  types  of  probabilities  [26],  For 
the  first  type  (type-1),  the  probability  on  the  domain,  which 
can  be  used  to  give  semantics  to  formulae  involving  questions 
like  “for  a  randomly  selected  individual  in  a  randomly  chosen 
model,  the  chance  that  this  individual  is  an  instance  of  A  is 
0.5”.  Type-1  probability  may  be  empirically  determined  by 
sampling  the  domain  of  possible  models  without  considering 
any  background  knowledge. 

The  second  type  (type-2)  of  probability,  which  is  essentially 
what  Bar-Hillel  and  Carnap  have  used  in  their  propositional 
version  of  CSIT,  is  a  probabilistic  distribution  on  possible 
worlds,  or  degree  of  belief  as  called  by  some  authors  [26]. 
It  involves  questions  like  “The  probability  that  F  is  true  is 
0.5”  where  F  is  a  first-order  sentence  (i.e.,  a  formula  without 
free  variables),  e.g.,  3 xA(x).  F  is  true  in  some  models,  and 
false  in  some  other  domains.  The  likelihood  that  F  is  true  will 
be  a  statistical  measurement  over  the  set  of  all  models ,  but  not 
over  domains  of  these  models.  Evaluating  type-2  probability  is 
related  to  the  probabilistic  logics  of  Nilsson  [47],  Scott  [52], 
Gaifman  [22]  and  many  of  their  follow-up  work  on  first-order 
logic  (e.g.,  [2,  30],  also  see  a  survey  [16]). 

In  first-order  logic,  since  the  domain  of  models  may  vary, 
we  need  to  know  the  distribution  function  //  of  models. 
Different  assumptions  may  be  made  when  we  work  with 
different  domains.  A  generic  a  priori  distribution  may  be  the 
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“algorithmic  Solomonoff  probability”  [29]  of  models  accord¬ 
ing  to  their  minimal  descriptive  length,  i.e.,  their  Kolmogorov 
complexity.  That  is,  the  chance  that  we  choose  a  model  s  is 
2~k(s\  where  A'(s)  is  the  Kolmogorov  complexity  of  model 
s.  Intuitively,  in  this  distribution  a  “simple”  model  is  preferred. 
As  neither  Kolmogorov  complexity  nor  algorithmic  probability 
is  computable,  some  approximations  may  be  used  instead. 

In  some  domains,  we  may  use  assumptions  about  the  model 
distributions  based  on  properties  of  models.  For  example,  if 
the  models  are  set  of  tweets,  then  one  possible  distribution  is 
by  their  authors.  Another  possible  assumption  is  by  the  size 
of  models.  For  instance,  the  size  of  city  follows  the  Rank-size 
distribution  (Zipf’s  law)  [21],  and  the  number  of  some  animals 
in  a  region  follows  the  Poisson  distribution. 

Extending  SIT  to  first-order  language  would  help  connect 
this  area  to  the  Semantic  Web  [7],  as  Semantic  Web  languages 
such  as  RDF  [6],  OWL  [5]  and  RIF  [8]  are  rooted  in 
some  variations  of  first-order  languages.  For  instance,  efficient 
information-theoretical  algorithms  for  ontology  compression, 
ontology  mapping,  and  ontology  transmission  may  be  discov¬ 
ered  as  a  result. 

C.  Semantic  Misinformation 

When  a  message  or  knowledge  base  is  not  consistent,  it  may 
still  carry  some  useful  information,  but  may  also  carry  some 
misinformation.  Both  information  and  misinformation  should 
be  measured.  In  CSIT,  Bar-Hillel  and  Carnap  didn’t  separate 
the  two  notions,  hence  producing  the  Bar-Hillel-Carnap  Para¬ 
dox  (BCP).  Bar-Hillel  and  Carnap  have  commented  this  issue 
in  their  paper: 

“It  might  perhaps,  at  first,  seem  strange  that  a  self¬ 
contradictory  sentence,  hence  one  which  no  ideal 
receiver  would  accept,  is  regarded  as  carrying  with 
it  the  most  inclusive  information.  It  should,  however, 
be  emphasized  that  semantic  information  is  here  not 
meant  as  implying  truth.  A  false  sentence  which 
happens  to  say  much  is  thereby  highly  informative  in 
our  sense.  Whether  the  information  it  carries  is  true 
or  false,  scientifically  valuable  or  not,  and  so  forth, 
does  not  concern  us.  A  self-contradictory  sentence 
asserts  too  much;  it  is  too  informative  to  be  true.” 

[9] 

BCP  may  lead  to  counterintuitive  consequences  or  practical 
difficulties  in  applications.  For  instance,  suppose  we  have  a 
large  knowledge  base9: 

A\  A  Ai,  A...  A  Ak 

for  a  very  large  k.  If  we  add  a  “small”  inconsistency  to  the 
knowledge  base,  such  as 

Ai  A  A^,  A...  A  Afc  A  ~ 'A^. 

then  suddenly  the  knowledge  base  becomes  (trivially)  most 
informative.  As  a  large  knowledge  base  (e.g.,  the  Web)  is  very 

9We  use  "knowledge  base"  to  refer  information  sources  or  messages, 
depending  on  the  context  where  it  is  used. 


likely  to  be  inconsistent,  the  applicability  of  CSIT  would  be 
limited  due  to  BCP. 

Another  issue  of  BCP  is  that  it  makes  no  difference  between 
contradictions  of  different  kinds.  For  instance,  one  may  expect 
that  (A  A  -i A)  A(BA  ->B)  is  “worse”  than  A  A  ~>A.  However, 
in  CSIT,  both  of  them  have  the  same  maximum  amount  of 
information.  As  such,  CSIT  is  not  able  to  measure  misinfor¬ 
mation. 

Solutions  to  BCP  include  assigning  to  all  inconsistent  cases 
the  same  infinite  information  value  [42],  excluding  inconsis¬ 
tent  cases  [31],  assigning  to  all  inconsistent  cases  the  same 
zero  information  [1,  43],  and  measuring  information  based  on 
truthlikeness  [15,  19]. 

To  measure  semantic  misinformation,  we  plan  to  extend 
the  model-theoretical  semantics  we  studied  to  a  paraconsistent 
semantics,  e.g..  Logic  of  Paradox  (LP)  [48].  D’Alfonso  [15] 
has  first  proposed  this  approach.  However,  he  does  not  distin¬ 
guish  information  and  misinformation.  We  also  note  that  the 
LP  semantics  is  not  the  only  choice  for  handling  inconsistent 
knowledge.  There  is  a  large  body  of  study  on  measuring 
incoherence  or  inconsistency  in  logics,  e.g.,  the  ones  using 
alternative  semantics  [18]  [62]  [27],  operational  semantics 
[36],  and  other  approaches  [28]  [49]  [24],  The  LP  semantics  is 
preferred  due  to  its  relative  simpleness  and  that  it  is  a  natural 
extension  of  our  model-theoretical  approach10. 

D.  Semantic  Mismatch 

Semantic  mismatch  may  arise  from  differen  reasons.  If 
the  sender  and  the  receiver  use  different  local  knowledge 
bases  or  inference  procedures,  a  received  message  may  not 
be  interpreted  as  intended. 

Another  potential  cause  of  semantic  mismatch  is  when  the 
the  sender  and  the  receiver  do  not  share  the  same  universe  of 
interpretations.  A  local  model  semantics  [23]  could  be  needed 
for  this  case,  which  has  been  widely  used  in  studying  semantic 
differences  due  to  contextuality  and  modularity  in  knowledge 
bases. 

Different  model  distributions  may  also  lead  to  semantic 
mismatches.  In  extensions  of  CIT,  if  the  sender  and  the 
receiver  do  not  have  an  agreed  symbols  distribution,  errors  in 
decoding  is  almost  unavoidable  [32].  We  plan  to  study  whether 
we  can  extend  [32]  for  addressing  semantics. 

E.  Semantic  Noises 

Semantic  noises  are  different  from  technical  noises,  which 
can  be  modeled  as  a  random  process.  Typical  technical  noise 
patterns,  e.g.,  Additive  white  Gaussian  noise  (AWGN),  are 
useful  for  gaining  insight  into  the  behavior  of  a  noisy  commu¬ 
nication  channel  before  we  consider  other  more  complicated 
reasons  for  communication  interferences.  We  are  going  to 
extend  our  framework  to  study  questions  like:  Are  there  typical 
semantic  noise  patterns  that  we  can  precisely  model?  How 
semantic  noise  is  related  to  semantic  mutual  information?  How 
such  noise  patterns  affect  lossy  communication? 

10Many  properties  of  classic  logics  still  hold  in  the  LP  semantics,  e.g.,  De 
Morgan's  laws.  However,  modus  pollens  does  hold  in  the  LP  semantics 
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F.  Relation  to  Algorithmic  Information  Theory 

Another  related  area  of  research  is  Algorithmic  Information 
Theory  (AIT)  [11]  and  Kolmogorov  Complexity  [38].  It  has 
been  shown  that  Shannon’s  statistical  definition  of  information 
is  closely  related  to  algorithmic  information  as  measured  by 
Kolmogorov  complexity  [25].  How  is  SIT  related  to  AIT? 
Are  there  universal  semantic  coding  algorithms  (i.e.,  distri¬ 
bution  independent)  corresponding  to  universal  (syntactical) 
coding  algorithms  studied  in  AIT?  How  resource-bounded 
Kolmogorov  complexity  is  related  to  bounded  rationality  in 
communication?  We  believe  investigating  these  connections 
may  help  us  to  better  understand  the  both  areas. 

VII.  Conclusion 

In  this  paper,  we  presented  some  initial  results  of  our 
investigation  into  measuring  semantic  information  and  seman¬ 
tic  coding.  We  proposed  a  model-theoretical  framework  for 
measuring  semantic  information  in  information  sources  and 
communication  channels. 

An  interesting  result  is  that  the  fundamental  theorems 
of  classical  information  theory  have  semantic  counterparts. 
These  theorems  reveal  the  existence  of  some  semantic  coding 
algorithms  for  data  compression  and  reliable  communication. 
However,  as  in  Shannon’s  paper  [54],  these  theorems  do  not 
tell  us  how  to  develop  optimal  coding  algorithms.  We  note  that 
for  both  source  coding  and  channel  coding,  bound-achieving 
algorithms  could  be  computationally  difficult.  Efficient  seman¬ 
tic  coding  algorithms  deserve  further  investigation. 

This  paper  is  intentionally  focused  on  an  abstract  basic 
model  of  semantic  communication  so  that  we  can  focus  on 
the  “core”  issues.  We  will  extend  the  framework  in  future 
work  as  suggested  in  the  discussion  section. 
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Appendix 

A.  Proofs 

Proof  Sketch  of  the  Semantic  Channel  Coding  Theorem 

To  prove  that  Cs  is  indeed  an  upper  bound  for  error-free 
transmission  rate,  we  use  a  similar  strategy  to  that  used  by 
Shannon  [54]  in  deriving  the  engineering  channel  capacity. 

Shannon’s  proof  relies  on  the  Asymptotic  Equipartition 
Property  (AEP)([13],  Chapter  3).  AEP  states  that  for  inde¬ 
pendently,  identically  distributed  (i.i.d.)  random  variables  X,, 
the  probability  of  observing  the  sequence  X1:  X2,  ...,  Xn  is 
close  to  2~nH<'X\  Thus,  the  set  of  all  possible  sequences  can 
be  divided  into  typical  sets  where  the  sample  entropy  is  close 
to  the  entropy  of  individual  variables,  and  other  non-typical 
sets  with  low  possibilities.  We  only  need  to  discuss  typical 
sets,  and  their  properties  are  true  with  high  probabilities  for 
all  sequences. 

The  argument  goes  as  follows: 

1)  A  semantic  error  occurs  if  a  received  message  is  not 
entailed  by  the  currently  observed  interpretation  at  the  source. 

2)  Let  N  be  a  sufficiently  large  number.  W  =  wi, ...,  Wn  is 
the  sequence  of  observed  interpretations.  Accordingly,  A"  and 
7”  are  sequences  of  sent  messages  and  received  messages. 

3)  According  to  AEP,  there  are  2H(y ) *N  typical  Y  se¬ 
quences,  and  2h(x')*n  typical  X  sequences.  For  each  typi¬ 
cal  Y  sequence,  there  are  2,,<X^Y> * N  possible  typical  input 
message  sequences  of  X. 

4)  Let  the  rate  of  transmission  be  R  (messages/time  unit). 

For  any  possible  typical  sequence  of  X,  the  chance  that  it 
is  indeed  a  message  sequence  is  2^R~H^X^*N .  For  a  typical 
sequence  of  Y,  there  are  ^  )+R-h(x))*n  typjcai  input 

messages. 

5)  For  each  typical  sequence  of  X,  there  are  2H(W\X')*N 
typical  sequences  of  interpretations  that  cause  it.  For  a  typical 
sequence  of  Y,  the  number  of  typical  sequences  of  interpre¬ 
tations  that  cause  it  is  2<'H(-X^+R~H(-X^+H^W^X^*N 

6)  For  a  randomly  chosen  interpretation  w  and  message  y, 
the  chance  that  w  1=  y  is  m(y).  Therefore,  for  a  sequence 
of  W  and  a  sequence  of  Y,  the  chance  that  each  segment  of 
W  is  a  model  of  the  corresponding  segment  of  Y  is  M  = 
m(y1)m(y2)...m(yN).  Since  logM  =  logj/i  +  log y2  +  ...  + 

log Vtv  =  -ZiHa(yi),  M  =  2~H^>N. 


7)  For  a  typical  sequence  of  Y,  the  chance  that  there 
is  a  semantic  error,  i.e.,  none  of  the  typical  sequence  of 
interpretations  that  cause  it  (via  X)  entails  it,  is 

^  _  9-hRy)*n,) 2<-h(x'y)+r-hw+h(w'x»*n 

When  N  -X  00,  the  above  expression  approaches 

Y  _  2(H(X\Y)+R-H(X)+H(W\X)-H3(Y))*N 

If 

R<R0  =  H{X)  -  H(X\Y)  +  HS{Y)  -  H(W\X) 

the  probability  of  semantic  errors  approaches  0. 

8)  If  the  average  error  of  all  possible  semantic  coding 
strategies  can  approach  0  below  transmission  rate  Rq,  there 
must  exist  a  semantic  coding  algorithm  that  is  better  than  the 
average  performance. 
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